How to represent a document?
What are vector space and bag-of-words models?
Features in text? And how to do text feature selection?
How to classify text data?
How to evaluate a classifier?
How to represent a document?
What are vector space and bag-of-words models?
Features in text? And how to do text feature selection?
How to classify text data?
How to evaluate a classifier?
Supervised learning: Learning a function that maps an input to an output based on example input-output pairs.
infer a function from labeled training data
use the inferred function to label new instances
Human experts annotate a set of text data
Which problem is not a text classification task? (less likely to be)
Author’s gender detection from text
Finding about the smoking conditions (yes/no) of patients from clinical letters
Grouping similar news articles
Classifying reviews into positive and negative sentiment
Go to www.menti.com and use the code 9594 3321
Represent by a string?
Represent by a list of sentences?
Represent by a vector?
A vector space is a collection of vectors
Represent documents by concept vectors
Each concept defines one dimension
k concepts define a high-dimensional space
Element of vector corresponds to concept weight
Distance between the vectors in this concept space
The process of converting text into numbers is called Vectorization
Terms are generic features that can be extracted from text
Typically, terms are single words, keywords, n-grams, or phrases
Documents are represented as vectors of terms
Each dimension (concept) corresponds to a separate term
\[d = (w_1, ..., w_n)\]
The Corpus represents a collection of documents (the dataset)
The Vocabulary is the set of all unique terms in the corpus
Bag of Words
Topics
Word Embeddings
With Bag of Words (BOW), we refer to a Vector Space Model where:
Terms: words (more generally we may use n-grams, etc.)
Weights: number of occurrences of the terms in the document
Term as the basis for vector space
Doc1: Text mining is to identify useful information.
Doc2: Useful information is mined from text.
Doc3: Apple is delicious.
Zipf’s law
Source:https://en.wikipedia.org/wiki/Zipf's_law
–>
Binary
Idea: a term is more important if it occurs more frequently in a document
TF Formulas
Let \(t(c,d)\) be the frequency count of term \(t\) in doc \(d\)
Raw TF: \(tf(t,d) = c(t,d)\)
Let \(n_{d,t}\) denote the number of times the \(t\)-th term appears in the \(d\)-th document.
\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing the \(t\)-th term.
\[IDF_t = log(\frac{N}{N_t})\] TF-IDF weight:
\[w_{d,t} = TF_{d,t} \cdot IDF_t\]
library(tm)
## Loading required package: NLP
data <- c('Text mining is one of the Utrecht summer school courses.',
'There are other data science courses in Utrecht summer school')
# convert data to vector space model
corpus <- VCorpus(VectorSource(data))
# create a dtm object
dtm <- DocumentTermMatrix(corpus,
list(removePunctuation = TRUE,
stopwords = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
inspect(dtm)
## <<DocumentTermMatrix (documents: 2, terms: 9)>> ## Non-/sparse entries: 13/5 ## Sparsity : 28% ## Maximal term length: 7 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs cours data mine one school scienc summer text utrecht ## 1 1 0 1 1 1 0 1 1 1 ## 2 1 1 0 0 1 1 1 0 1
Feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm.
high dimensionality of text features
Select the most informative features for model training
Reduce noise in feature representation
Improve final classification performance
Improve training/testing efficiency
Less time complexity
Fewer training data
Wrapper method
Filter method
Evaluate the features independently from the classifier and other features
No indication of a classifier’s performance on the selected features
No dependency among the features
Feasible for very large feature set
Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:
\[\sum^k_{c=1}{p(c | t)=1}\]
Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:
\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]
The value of the gini-index lies in the range \((1/k, 1)\).
Higher values of the gini-index indicate a greater discriminative power of the term t.
If the global class distribution is skewed, the gini-index may not accurately reflect the discriminative power of the underlying attributes.
\({\chi}^2\)-Statistic
Input:
A training set of \(m\) manually-labeled documents \((d_1,c_1),\cdots,(d_m,c_m)\)
a fixed set of classes \(C = \{c_1, c_2,…, c_J\}\)
Output:
Rules based on combinations of words or other features
Rules carefully refined by expert
But building and maintaining these rules is expensive
Data/Domain specifics
Not recommended!
Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. Using a training set of documents, the Rocchio algorithm builds a prototype vector, centroid, for each class. This prototype is an average vector over the training documents’ vectors that belong to a certain class.
\[\boldsymbol{\mu_c} = \frac{1}{|D_c|}\sum_{\mathbf{d} \in D_c}{\mathbf{d}}\]
Where \(D_c\) is the set of documents in the corpus that belongs to class \(c\) and \(d\) is the vector representation of document \(d\).
The predicted label of document d is the one with the smallest (Euclidean) distance between the document and the centroid.
\[\hat{c} = \arg \min_c ||\boldsymbol{\mu_c} - \mathbf{d}||\]
Given a test document d,. the KNN algorithm finds the k nearest neighbors of d among all the documents in the training set, and scores the category candidates based on the class of the k neighbors.
After sorting the score values, the algorithm assigns the candidate to the class with the highest score.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. C
Can weight the neighbors such that nearer neighbors contribute more to the fit.
Applied to documents and classes
For a document \(d\) and a class \(c\)
\[P(c|d) = \frac{P(d|c)P(c)}{P(d)}\]
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature probabilities \(P(w_i|c_j)\) are independent given the class \(c\).
\[P(w_1, \ldots, w_n|c) = P(w_1 | c) \cdot P(w_2|c) \cdot P(w_3|c) \cdot \ldots \cdot P(w_n|c)\]
\[C_{MAP} = \underset{c \in C}{\operatorname{argmax}}P(w_1, w_2, \ldots,w_n|c)P(c)\]
\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c_j)\prod_{w \in V}{P(w|c)}\] \[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c_j)\prod_{i \in positions}{P(w_i|c_i)}\]
First attempt: maximum likelihood estimates
\[\hat{P}(c_j) = \frac{doccount(C = c_j)}{N_{doc}}\] \[\hat{P}(w_i|c_j) = \frac{count(w_i, c_j)}{\sum_{w \in V}count(w, c_j)}\]
What if we have seen no training documents with the word coffee and classified in the topic positive (thumbs-up)?
\[\hat{P}(\mathrm{''coffee''|positive}) = \frac{count(\mathrm{''coffee'', positive})}{\sum_{w \in V}{count(\mathrm{w,positive})}}\] Zero probabilities cannot be conditioned away, no matter the other evidence!
\[C_{MAP} = \underset{c}{\operatorname{argmax}}\hat{P}(c)\prod_i{\hat{P}(w_i|c)}\]
\[ \begin{align} \hat{P}(w_i|c) &= \frac{count(w_i,c)+1}{\sum_{w \in V}{(count(w,c)+1)}} \\ &= \frac{count(w_i,c)+1}{(\sum_{w \in V}{count(w,c) + |V|})} \end{align} \]
\(docs_j\leftarrow\) all docs with class = \(c_j\)
\[P(c_j) \leftarrow \frac{|docs_j|}{|total\ \#\ documents|}\]\(Text_j \leftarrow\) single doc containing all \(docs_j\)
For each word \(w_k\) in Vocabulary
\(n_k \leftarrow\) # of occurrences of \(w_k\) in \(Text_j\)
With the kernel trick, SVM can construct a nonlinear decision surface in the original feature space by mapping the data instances non-linearly to a new space where the classes can be separated linearly with a hyperplane.
SVM is quite robust to high dimensionality.
A decision tree is a hierarchical decomposition of the (training) data space, in which a condition on the feature value is used in order to divide the data space hierarchically.
Training set
Test set
Validation set (dev set)
https://scikit-learn.org/stable/modules/cross_validation.html
Accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed.
Let us say that our target class is very sparse. Do we want accuracy as a metric of our model performance? What if we are predicting if an asteroid will hit the earth? Just say “No” all the time. And you will be 99% accurate. The model can be reasonably accurate, but not at all valuable.
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.
A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):
\[F = \frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}}=\frac{(\beta^2+1)PR}{\beta^2P + R}\]
The harmonic mean is a very conservative average; see IIR § 8.3
Balanced F1 measure - i.e., with \(\beta = 1\) (that is, \(\alpha = 1/2\)): \(F = 2PR/(P+R)\)
Manually written rules
If (x or y) and not (w or z) then categorize as class1
Need careful crafting
Low accuracy
Domain-specific
Time-consuming
Active learning
Unsupervised methods
Use Naïve Bayes, KNN, Rocchio
Get more labeled data
Find ways to label data for you
Try semi-supervised methods:
Perfect for all the complex classifiers
SVM
Regularized Logistic Regression
Random forest
Can achieve high accuracy!
At a cost:
With enough data
Domain-specific features and weights: very important in real performance
Sometimes need to collapse terms:
Part numbers, chemical formulas, …
But stemming generally doesn’t help
Title words
First sentence of each paragraph (Murata, 1999)
In sentences that contain title words
Corpus: is a large and structured set of texts
Stop words: words which are filtered out before or after processing of natural language data (text)
Unstructured text: information that either does not have a pre-defined data model or is not organized in a pre-defined manner.
Tokenizing: process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens (see also lexical analysis)
Natural language processing: field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages
Term document (or document term) matrix: is a mathematical matrix that describes the frequency of terms that occur in a collection of documents
Supervised learning: is the machine learning task of inferring a function from labeled training data
Unsupervised learning: find hidden structure in unlabeled data
Vector space model & BOW
Feature Selection
Text Classification
Evaluation